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Abstract. In the online prediction framework, we use generalized en- 
tropy of to study the loss rate of predictors when outcomes are drawn 
according to stationary ergodic distributions over the binary alphabet. 
We show that the notion of generalized entropy of a regular game [10] is 
well-defined for stationary ergodic distributions. In proving this, we ob- 
tain new game-theoretic proofs of some classical information theoretic in- 
equalities. Using Birkhoff 's ergodic theorem and convergence properties 
of conditional distributions, we prove that a classical Shannon-McMillan- 
Breiman theorem holds for a restricted class of regular games, when no 
computational constraints are imposed on the prediction strategies. 
If a game is mixable, then there is an optimal aggregating strategy which 
loses at most an additive constant when compared to any other lower 
semicomputable strategy. The loss incurred by this algorithm on an infi- 
nite sequence of outcomes is called its predictive complexity. We use our 
version of Shannon-McMillan-Breiman theorem to prove that when a re- 
striced regular game has a predictive complexity, the predictive complex- 
ity converges to the generalized entropy of the game almost everywhere 
with respect to the stationary ergodic distribution. 



1 Introduction 

We consider the online prediction question studied by [IS] , [IS] , [ID] , [7] , [5] in the 
setting of a stationary stochastic process. In this setting, we have a sequence of 
outcomes xq, x\, . . . from a finite alphabet. A predictor, given the history up to 
a certain index, predicts what the next outcome will be. We allow the predictor 
to present its prediction as a convex combination which represents the weight 
it assigns to each outcome in the alphabet. The game proceeds by revealing the 
next outcome, and then asking for the prediction of the future outcome. For an 
overview of this area, see [2] . Independently, Merhav and Feder [12] , Feder [4j and 
Feder et. al. [5] have studied the question of optimal finite-state predictors with 
respect to Shannon entropy, in the setting of stationary Markov Chains. It is 
known that the log-loss game characterizes Shannon entropy. The present line of 
work generalizes their approach in two ways - first, in considering loss functions 
besides log-loss, and second, in considering optimal processes over stationary 
ergodic distributions. 



A natural question in this context is how well the predictor is doing as the 
game progresses. We measure the discrepancy between the actual outcome and 
the predicted one, with a loss function. This helps us to ask whether optimal 
predictors exist - those which incur at most the same loss as as any other pre- 
dictor on any outcome, ignoring additive constants. Indeed if such an optimal 
predictor exists, we can use its loss rate on a particular sequence of outcomes to 
define its inherent predictability (see for example, [15] . |16j). 

Besides competitive advantage above other predictors, we can also charac- 
terize the performance of an optimal predictor by examining its expected loss 
assuming the outcomes are drawn from a particular distribution. Prior work by 
Kalnishkan et al. 10] establishes that if the outcomes are drawn independently 
according to a Bernoulli distribution on the alphabet, then the expected loss 
rate of an optimal predictor is the generalized entropy [S] of the loss function. In 
this paper, we extend this result to the important setting of stationary ergodic 
distributions. 

The contributions of our paper are threefold. 

1. First, we show that the generalized entropy rate of a stationary ergodic 
process is well-defined, if the game is regular. We provide "game-theoretic" 
proofs of classical information-theoretic inequalities, giving new intuitive 
proofs even in the special case of the Shannon entropy. This constitues sec- 
tions 3 and 4 of the paper. 

2. Second, under a continuity and an integrability constraint, we show that op- 
timal strategies exist for regular gameslll We show that the loss rate incurred 
by such a strategy is the generalized entropy rate of the stationary ergodic 
process. This is a Shannon-McMillan-Breiman theorem for generalized en- 
tropy. This result is new, and we provide a proof using Vitali Convergence. 
This constitutes section 5 of the paper. 

3. Using the above results, we show that when a game has predictive complexity, 
an optimal aggregator algorithm attains the entropy rate of the game. 
The proof that the aggregator incurs at most the entropy rate of loss crucially 
uses our Shannon-McMillan-Breiman Theorem. 

The proof that the aggregator incurs at least the entropy rate of loss uses 
some properties of stationary ergodic processes that we prove in Sections 3 
and 4. This constitutes the final section of the paper. 

2 Preliminaries 

As defined in |10) . a game Q is a triple A) where E is a finite alphabet 

space, r is the space of predictions and A : £ x r — > [0, oo] is the loss function, 
to be defined below. We will only consider the binary alphabet in this paper. 

1 There is an independent characterization of games with optimal strategies in terms 
of convexity of loss-regions [9] . We deal with this approach in the final section of our 
paper. 



Intuitively, we model a predictor function which, given the string of outcomes 
so far, will predict the next outcome. We consider a slightly general framework 
where the predictor does not have to necessarily predict only one outcome. It is 
allowed to output a point (po,Pi) £ P 2 (equivalently, a probability vector, where 
Po is the predicted probability that the next bit is 0, and pi, the probability that 
the next bit is 1). The game proceeds by revealing the next outcome. Let this 
outcome be b. The prediction strategy is said to incur the loss A(&, (po-,Pi))- 

As is customary, we adopt the notation N for the set of natural numbers, 
starting from 0. The set of strings of length n is denoted S n . The set of finite 
binary strings is denoted S* and the set of infinite binary sequences is denoted 
S°° . For a finite or an infinite sequence x, the notation x\ denotes x\ . . . Xj. If 
x is shorter than n bits, £q 1 denotes x itself. If a; is a finite string, and w is a 
finite string or an infinite sequence, then x-lu denotes the result of concatenating 
uj to x. For each natural number i, let IP be the class of all functions mapping 
i-long strings to r. 

We call a family of functions p a strategy if Vi S N, \p (lIT l \ = 1, i.e, there is 
unique function which takes an i-length string as input and produce a strategy 
based on the input. We call that function p l . Thus the prediction strategy is a 
non-uniform family. We impose no computational constraints until the final part 
of the paper. 

3 Loss functions 

The generalized entropy of a game is defined in terms of convex loss functions 
described above. We define the losses incurred by a strategy on a finite string w 
of outcomes, as the cumulative loss that it incurs on each bit of w. This follows 
the definition given in [TO] and [9] . We generalize the notion slightly to deal with 
the expected loss that a strategy incurs with respect to a stationary distribution. 

Definition 1. The loss that a prediction strategy p, incurs on a finite string w 
of outcomes is defined to be 



In order to study when a strategy is better than another, we study the average 
loss it incurs, when outcomes are drawn from a stationary distribution. We 
consider the strategy which incurs the minimal expected loss on a particular set, 
if such a strategy exists. Let P, P) be the probability space where T is the 

Borel CT-algebra generated by cylinders 



for all finite strings x. and P : T —> [0, 1] is the probability measure. 

Let X — (Xq , X\ , . . . ) be a sequence of random variables on the probability 
space - for each i e N, Xi maps to K. For k > 1, let SkX denote the sequence 
(X k ,X k+ i, ...) - that is, X "shifted left" k times. 



w I - 1 




Cji = {u 6 | a; is a prefix of a;} 



Definition 2. A sequence of random variables X is stationary if the prob- 
abilities of SkX and X coincide for every k > 1. That is, for every Borel set B 
in the a-algebra over R°°, 



We could also use the terminology of measure-preserving transformations 
to capture stationarity. A transformation T : Q — > ft is said to be measure- 
preserving if for every A € P(T~ 1 A) = P(A). A measure-preserving trans- 
formation is said to be ergodic if T~1(A) = A if and only if P(A) is either or 



The class of stationary processes correspond almost exactly to the class of 
probability spaces J-, P, T) where T : Q — > ft is a P-measure-preserving 
transformation. For k G N, let T k denote the iterated application of T on itself, 
k times. It is easy to see that if T is measure preserving and Xq is a random 
variable, then (Xq, Xq o T, Xq o T 2 , . . . ) is a stationary sequence. We also have 
the converse. 

Lemma 1. flSf For every stationary sequence X on a probability space (J?, J 7 , P), 
there is a probability space (i?, J 7 , P), a random variable X and a P -measure pre- 
serving transformation T : Q — > Q such that the distribution of (Xo, Xq oT, Xq o 
T 2 , . . . ) coincides with the distribution of X. 

On an alphabet space, we are interested in the coordinate random variables 
Xi{u) = u>i (i G N), and any probability distribution such that X is stationary 
with respect to it, will be called a stationary distribution. A probability space 
with respect to which the left-shift transformation is ergodic will be called an 
ergodic distribution. 

Definition 3. We define the n-step generalized entropy of the game to be 



where {E°° 1 J- 1 P) is a stationary probability space. 

In order to avoid degenerate games (for example, games where the least 
expected loss is infinity, precluding any incentive to play the game), Kalnishkan 
et al.|10) restricts the game in the following manner. 

— We restrict r to be a compact space. For the binary alphabet space, the 
prediction space is [0, 1]. 

— The loss function A is an extended real- valued convex function on S x r. 
We take the discrete topology on the alphabet and the standard topology 
on [0, 1]. Then A is continuous with respect to their product topology. 

— There is a prediction 7 e _T such that for every b S E, the inequality 
A(&, 7) < 00 holds. This property ensures that the n-ary entropy is a finite 
quantity. 



P(X e B) = P{S k X G B). 



in 




(1) 



— If there are 70 € r such that for some b £ £, the loss A(6, 7) = 00, then there 
is a sequence 71, 72, • ' ' ~~ * 7 such that for each 73, we have A(6, 7^) < 00. 

A game which obeys these conditions is said to be regular. The last condition 
is necessary (but not sufficient) to ensure that predictive complexity exists for 
the game. We need this property crucially in Theorems @] and [5] 

The n step generalized entropy is the least expected loss incurred by any 
strategy, on U n . Since 2J n is a compact space and A is continuous in both its 
arguments, the infimum in the above expression is attained by some strategy. 

Example 1. The Log-Loss game: Consider the binary alphabet and predictions 
be values in [0,1]. Let po and p\ be the probability of the bit and bit 1, 
respectively. 

Suppose we define the loss function by A(6, 7) = — log(| 6 — 7 |), where b is a 
bit, b its complement, and 7 s [0, 1]. Then the minimal expected loss over one 
bit is obtained at 7 = pi, ensuring that H(j>i) is the Shannon entropy of the 
distribution. (End of Example) 

Definition 4. The generalized conditional entropy of S n given S m is defined 
as 

m— 1 

ff„| m =inf ]T P(w) ]T P{x \w}^2,\ (xi, p i+m (w ■ Zq" 1 )) 

m— 1 

= inf Yl P ^ E A ( x *> ^ + "> • x o _1 )) 

This is an analogue of the definition of conditional Shannon entropy. The 
inner term in Definition 0] can also be expressed as follows. 

m — 1 

^ A [xi, p l+m {w ■ Xq' 1 )) = Loss(wx, p) — Loss(w, p). 

i=0 

When we generalize the theory to handle arbitrary loss functions, we do lose 
some ideal properties that Shannon entropy has. The following theorem states 
that Shannon entropy is the unique function having certain ideal properties that 
we desire in a measure of information . 

Theorem 1. Suppose F is a continuous function mapping n-dimensional prob- 
ability distributions to [0, 1] having the following properties. 

1. For any random variables A and B, F(AB) — F{A) + F{B\A). 

2. The n-dimensional uniform distribution has the largest entropy among n- 
dimensional distributions. 

3. F(p 1 ,p 2 ,...,Pn,0) = F(p 1 ,p 2 , ■ ■ . ,p n ). 

2 The authors remark in [10] that such a strategy need not exist for £* . 



Then there is a positive constant A such that for every n-dimensional probability 
vector (pi, . . . ,p n ), H(pi,p 2 , ■ ■ ■ ,p n ) = AF(pi,p 2 , • • • ,p n )- 

With our definition of the cumulative loss, we can establish the chain rule 
for generalized entropy. 

Lemma 2. For all positive natural numbers m and n, we have H m+n = H m + 

TT 

Proof. In Definitional P l f° r < i < m does not play any role in the infimum 
and likewise in Definition [31 p l for i > n does not play any role in the infimum 
inf. This observation allows us to deduce that 

/ m— 1 \ 



H m + H n/m = inf ( J2 P H E P ^ x I w > E A ^ + "> ' x o^) 

inf \^ P(iu)Loss(w;, p) 



(m-1 \ 
Y,X(x i ,p i + m (wxi- 1 ))+ J2 Lobs(«;,p) . (2) 

iei" i=o we£ m ) 

Now, 

(m-l \ 

loss( W , P )+ J2 p i w ' i w > E a o*> p i+ro ( w • 

/ m— 1 \ 



:inf ^ ^ P{w' \w} I Loss(w, p)+^2X(xi,p i+m {w xl' 1 )) 

= inf } P(ui)Loss(w, p) = H„ 



Since A is non-negative, it is clear that all entropies defined so far are non- 
negative. An immediate consequence of this is H m+n > H m for all m, n > 0. We 
see that this style of proof referring to strategies in games yields new intuitive 
proofs of such inequalities. 

Since conditions 1 and 3 in Theorem [T] are satisfied, Khinchin's uniqueness 
theorem therefore leads us to conclude that with a generalized entropy, the 
uniform distribution need not have maximal entropy - for example, the square- 
loss is not maximized at the uniform distribution. 



4 Entropy of a Regular Game 

The goal of this section is to define the notion of the entropy of a regular game. 
Our idea is to define it to be the limiting rate of the n-step generalized entropies 
of the game. We now show that if the game is regular and the probability dis- 
tribution is stationary, such a limit exists. Thus the notion of the entropy of a 
regular game is well-defined. 



Lemma 3. [Generalized Shannon Inequality] For any regular game and non- 
negative integers m and n, we have H m / n < H m . 

Proof. The following proof is for m = 1. In this special case Hi = inf P(a)X(a, 7) 
and 

H M" = gf. J2 P M E P i fl I /(«0) = & E P ( fl ) E P i w I «M(a,/H) 

Now pick the 7 £ -T which matches 77i. We can do this because regularity 
condition of game requires r to be compact. The loss function is continuous in 
both its arguments ensuring that the expected loss in (TfJ is a continuous function 
on a compact space. Now define /' : S n —5- {7}. Clearly, /' G 77™. So, 

H 1/n < ]T P(a) p i w I a}X(a,f(w)) = £ P(o) £ P{ W | a}A(a, 7 ) 
= ^P(a)A(a, 7 ) = Pi 

a£S 

The general case proceeds by induction by defining f n + n (w Wq~ 1 ) = f l (tOg ~ 1 ) , 
where w is an n-long string and 1 < i < m. 

In the special case of the log-loss game with a Bernoulli distribution on the 
finite alphabet, the argument above yields a new argument for the Shannon 
inequality. 

Lemma 4. For any regular game, any stationary distribution P defined on it, 
and any positive pair of natural numbers m and n, H m \ n > 77 m | Jl+1 . 

Proof. We prove the inequality for m = 1. The general case would follow from 
application of Lemma [2l We have, 

H ^ = f % E %)E p { tt i i »W B -/H)= *f.E E p{^}\(aj(w)) 

w££ n a££ aeSweS" 

and similarly H 1 / n+1 = inf P{wa}X(a, /'(to)). 

We show for each / e 77™ we have a /' 6 F'^ 1 which matches the inner 
quantity on which infimum is taken. Then, by taking infimum over F n+1 ,we 
would have H 1 / k > H 1 / k+1 . Fix a / G 77™ and consider /' 6 F n+1 defined as 
f'(bw) = f{w) for all to 6 27", be S. Now, 

J2 E «%/'W) = EE E P{bw'a}X(aJ'(bw')) 

= E E E F {^' fl }A(a,/V)) 
= E E P{«MA(a, /(«/)) 

a62 ■uj'eZ' r » 



where the last step follows from stationarity of P (i.e, Ybes P{bw} — P{w} for 
all w ' .V" i. 

Theorem 2. For any regular game Q and stationary J-, P), lim ex- 

n— ¥00 fi 

ists and is finite. 

Proof. From the regularity condition, we get Hi is finite. From Lemma [2J it 
follows that H n — J2"=o #i|»- 

By LemmalU > -ffi | (fc+i) • Since entropies are non-negative, the sequence 
{H\\n} is a bounded, monotone decreasing sequence of reals. Hence, it has a limit 
which we denote by i?i|oo- It also follows that i?i|oo is at most H\. 

So by Cesaro mean, lim — = lim^oo i Ya-o H iH = ^n^ooHu n = 

#11 CO' 

Definition 5. Let Q — (U°°,r,X) be a regular game and (S°°,J-,P) be a sta- 
tionary distribution. Then The generalized entropy of the game is defined as 

H = lim — — . 

n— ¥00 n 



5 A Shannon-McMillan-Breiman Theorem 

We now show that for regular games with a suitable restriction on the loss 
functions, optimal processes exist and they attain the generalized entropy rate 
of the stationary ergodic process. Our approach to this result is through uniform 
integrability and the Vitali Convergence theorem, which contrasts with the usual 
approach using the Dominated Convergence Theorem. First, we define the notion 
of a strongly regular game, for which the result holds. [f| We will derive two 
consequences of strong regularity, viz. 

1. The existence of a limiting function for the loss function, P-almost every- 
where. 

2. The integrability of this limiting function 

We urilize these in the proof of the Shannon-McMillan-Breiman Theorem. We 
conclude with two examples, illustrating that Theorem 3] properly generalizes 
the classical Shannon-McMillan-Breiman theorem. 

Definition 6. Let (f2,J-,P) be a probability space. A sequence of functions 
{/n}J^Li is called uniformly integrable if 




where I[\f n \> a ] is the indicator function which is 1 at points lo with \f n (uj)\ > a 
and is otherwise. 

3 Kalnishkan et al. [5] consider the notion of mixable games, which characterize regular 
games with optimality. In comparison, our conditions are based on integrability of 
the loss function. 



If the sequence {/n}^=i is uniformly integrable, then for every e > 0, and 
any large enough a, 

sup f \f n \dP<a + e (4) 
n J 

In addition to uniform integrability, we also need a continuity requirement over 
the space of strategies. We now introduce this. The next lemma characterizes 
-ffi|„ in terms of the loss incurred by an optimal strategy on E n . 

Lemma 5. 

p n» = f % E p H E i w ^ /H) = E p H f g„ E p ( a i w ^ /(«»)) 



Proof. Let n be an arbitrary number. For any string w of length n, P(w) > 0, 
thus it follows that 

/ e n i„ E p H E p {° i w ^ ^ E p H / e n j„ E p i° i ™} A ( a > /H)' 

hence it suffices to prove that that the opposite inequality holds. 

For each n-long string w, let f w be the function which attains the infimum 

^E^i^^/H)- 

Thus, the required expectation of infima can be written in terms of these 
functions as 

E p w E p {° i w ^ /H) = E p H E p {« i ^ A ( a > /-H)- 

We can now define a function / : S n — ^ 17 as 

/M = UM, w e S n . 

It is clear from the definition of the function that 

which implies the desired inequality. 

Lemma [S] lets us analyse loss incurred by some "optimal" strategy. From 
Lemma [5j we can see given w e S n , optimal loss depends on the conditional 
probability distribution (P{0 | w},P{l \ w}). Let s(P{0 \ w}) be the strategy 
that gives optimal loss in i?i|„. 



Let us define the following functions on 

g k (u>) = \(u }0 ,s(P{0\uI x k })) 
g(u J ) = X(uj ,s(P{0\wZl o })). 

So, Loss( W £-\ p n ) = 9k{T k uj). 

Definition 7. A regular game is strongly regular if 

1. s is a continuous function of the conditional probability. 

2. For each natural number N , define Gn : fi — ¥ [0, oo] by 

GjvM = sup |3fc(w) - g(oj)\ ■ 

k>N 

We require that {Gn}n=i * s a uniformly integrable sequence. 

First, we explain a consequence of condition (1). For a stationary ergodic dis- 
tribution P, P{0 | uiZl} P{0 I ^-lo} as fc — > oo, and since gk is a continuous 
function of the conditional distribution by condition (1), we have that g k — > g 
as k — > oo, P-almost everywhere. 

We now elicit some consequences of our assumption of uniform integrability. 
For uniformly integrable sequences of functions, their limit function is integrable 
even in the absence of any dominating function. This is known as the Vitali 
Convergence Theorem [5]. 

Theorem 3. Let (J?, J 7 , P) be a probability space. If {f n }%Li * s o, sequence of 
uniformly integrable functions such that f n — > f P -almost everywhere, then f is 
integrable and 

lim f \f n - f\dP = 0. 

Vitali Convergence of {Gjy}^ =1 will be required in the final part of the proof 
of Theorem 2J We first show that uniform integrability of {Gjy}^ =1 yields the 
integrability of the optimal loss. 

Lemma 6. For a strongly regular game and a stationary distribution P, 
lim / g n dP = / lim g n dP — I g dP. 



Proof. We know that for each n G N, 

J \g n \ dP = J g n dP = H Mn , 
which exists for regular games and stationary distributions. Now, for every n, 
J \ 9n \dP = J \g~g n -g\dP> J \g\dP - J \g - g n \dP. 



Hence we have 



H= lim / \g n \dP> / \g \dP - liminf / | 5 - 5 „|dP. (5) 
By the uniform integrability of {Gn}n = i, we have that 

lim / |<7-<7 n |dP = 0. 



Thus, by ©, we have > / \g\dP. 

Using uniform integrability and the notion of continuity, we can introduce 
the setting for our Shannon-McMillan-Breiman Theorem. 

For the sake of convenience, in the following proof, we will consider two-way 
infinite sequences. However, the same theorem holds for one-way sequences as 
well (see Chapter 13 of pQ). We briefly mention the formal correspondence. 

Let (X, £>, (J,) be a measure space with T being a measure preserving trans- 
form, not necessarily invertible. We construct a measure preserving system (X , B, 
fi,T) as follows. 

- Define X = {{x 2 ) 2( z N \ Xi E T^X, Tx i+X = x { for all i E E} 

- Let TTj : X — > T~ l X be the projection function which projects j th co-ordinate 
of an element of X, i.e, TTj(x) = Xj. Construct a a algebra £>' generated by 
sets of the form n^T^E, for all i G N, and E e B. 

- Let ji(-Kr l T- l E) = fi{E) for all E E B. 

- Complete B' with respect to fi to get B. 

- Define f : X -»• X by f{{Xi) im ) = ((Tx t ) ieN ). 

Clearly, T is an invertible transform given by T~ 1 {x\, X2, £3, ■ ■ • ) = (^2, £3, £4, • • ■ 
Since T is measure preserving, T is also measure preserving. (X, B, fi, T) is called 
natural extension of (X, B, fi,T). It is ergodic iff the original system is ergodic. 
For unilateral alphebet system, its natural extension has same entropy. For de- 
tails, see Fact 4.3.2 of [3]. 

Theorem 4. For a strongly regular game (U,r,X), and stationary ergodic dis- 
tribution (S°°,F,P), let H be the generalized entropy of the game. Moreover, 
let p be a strategy such that for every n, p" achieves H n . Then for uj E fl, the 
following holds: 

lim W^.P") = H (6) 

for P-almost every u>. 

We cannot use the Birkhoff's ergodic theorem (see for example, [T)) directly 
to prove the above theorem, since the summands in the Birkhoff average on the 
left of ([5]) depend in general on n, and are not the same integrable function. We 
however can use the convergence in conditional distributions ensured by a sta- 
tionary distribution, in conjunction with Birkhoff's ergodic theorem to establish 
our result. 



Proof. Recall that g k — > g almost everywhere, and J g exists by Lemma [51 We 
know Loss(cJq 1_1 , p") = g n (u>). 

Since T is measure preserving transformation, by change of variable, 



/ g k (u)dP = [ g k (T k Lu)dP = H M 
J n J n 



Thus 



/ g{w)dP = lim / g n (w)dP — lim H\\ n = H. 

J n—>oc J n^oo 

By the Ergodic theorem, we get 

n-i . 
lim - V g(T k w) = / g(w)dP = H, 

1. n ** 



for P-almost every lu e i?.' 
Now, 



- n — 1 - n— 1 1 n — 1 

- £ gk{T k w) = -J2 9(T k w) + - 5> fc (T fc W ) - g(T k w)). 

k=0 k=0 k=0 

where the first term tends to H as n — > oo. If we show second term in the 
previous equation is tends to a.e. as n — ¥ oo, we are done. 

Define Gn(ui) = sup k>N \g k (w) — g{w)\. By the assumption of strong reg- 
ularity, the sequence of functions {Gat}^ =1 is uniformly integrable. Also, since 
g n — > g P-a.e., we know that Gn — > P-almost everywhere as N — > oo. By the 
Vitali Convergence Theorem, 

lim G N dP= / lim G N dP = 0. 

N-too J J jV->oo 

Now for each N, 

u—l -, n — 1 



lim sup 



- J2(9k(T k oj) - g{T k Lo) < lim sup - V \g k {T k u) - g{T k u)\ 

ft 77 — V OO Tv 

k=0 k=0 

n-1 . 

< lim sup - GN(T k uj) = / G N (u)dP 

where the last equality follows from Birkhoff Ergodic Theorem. Note that this 
holds for all values of N and right side converges to a.e. as N ~ > oo. Since 
the left side is non-negative, it is a.e. So, ^ J2 k Zo (9k(T k ui) — g(T k uj)) — > as 
n — 7 s oo. This concludes the proof. 

Recall that the generalized entropy of the log-loss game is the Shannon en- 
tropy. We now show the square loss and the log-loss games are strongly regular, 
thus establishing that we have a proper generalization of the classical Shannon- 
McMillan-Breiman theorem. 



Example 2. Log-loss Game. The loss function A : {0, l}x [0, 1] — > [0, oo] is denned 

by 

A(6, 7 ) = -log(|6-7l). 

The optimal strategy is given by Sf. = P{0 \ wZt.}> which is a continuous 
function of the conditional probability. 
We have that for any TV, 

sup \gk(u) - g(u)\dP < / sup|g„(w) - g(uj)\dP < / sup|g n (w)| + / gdP. 

k>N J n>l J n>l J 

Hence to show that the sequence sup fe>Ar |<7fc(w) — g(w)\ is uniformly inte- 
grable, it sufhces to show that 

sup \g n (u)\dP 

n>l 

is integrable. It is easy to show that for a stationary distribution P and any 

r£l, 

P{u | sup \g k (u)\ >r}< 2e~\ 

k 

from which the integrability of sup fc gu follows. 

Thus sup fc> jv \gk — g\, for N = 1, 2, . . . forms a uniformly integrable sequence 
of functions, and Theorem 0] holds for the log- loss game. 

Example 3. Square-loss game. The loss function in the square loss game A : 
{0, 1} x [0, 1] -> [0, 1] defined by 

A(6, 7 ) = (&- 7 ) 2 . (7) 

The optimal strategy in the square-loss game is to pick 7 = p{l \ wZ^}, which 
is continuous in the conditional probability. 
This loss function is bounded, hence 

snp\g n (w)-g(w)\dP< [ UP = 1, 
fe>i J 

ensuring that Gm — sup fc>JV \gk(uj)—g(uj)\ is uniformly integrable. Thus Theorem 
H] holds for the square- loss game. 



6 Predictive Complexity of Stationary Ergodic Games 

We now consider computable prediction strategies. We would like to define the 
inherent unpredictability of a string x as the performance of an optimal com- 
putable predictor on x. It is not clear that one such predictor exists for any 
game. The work of Vovk and Watkins|15j establishes a sufficient condition for 
predictive complexity to exist. 



Definition 8. A pair of points (sq,s\) £ (—00, oo] 2 is called a superscor^j] if 

there is a prediction 7 £ r such that A(0,'y) < sq and A(l,'y) < s\. We denote 
the set of superscores for a regular game Q by S . 

Definition 9. A prediction strategy p : S* — > (— 00,00] is called a superloss 
process if the following conditions hold. 

1. p(A) = 

2. For every string x, the pair (p(xO) — p(x), p(xl) — p(x)) is a superscore with 
respect to the game. 

3. p is upper semicomputable. 

A superloss process K is universal if for any superloss process p there is a 
constant C such that for every string x, 

K{x) < p(x)+C. 

It follows that the difference in loss between any two superloss processes is 
bounded by a constant. Hence we may pick a particular superloss process JC 
and call K(x) the predictive complexity of the string x with respect to the game 

g. 

When we consider regular games, it is not necessary that an optimal strategy 
exists on S* which incurs at most an additive loss when compared to any other 
prediction process. However, Vovk [H] and Vovk and Watkins|15j introduced 
the concept of mixability to ensure that one such universal process exists. 

Definition 10. Let j3 £ (0, 1). Consider the homeomorphism hp : (— 00, oo] 2 — > 
[0, oo) 2 specified by hp(x, y) = (j3 x , j3 v ). A regular game g with set of superscores 
S is called /3-mixable if the set hp{S) is convex. A game g is called mixable if 
it is /3-mixable for some (3 £ (0, 1). 

Theorem 5. 17 5^ // a game g with set of superscores S is mixable, then g has 
a predictive complexity. 

It is known that the logloss and the square loss games are mixable. The 
coincidence of logloss and Kolmogorov complexity enables us to view predictive 
complexity as a generalization of predictive complexity. Absolute loss game is 
known not to be mixable [17] . 

We mention a loss bound which holds for mixable games. This is used in the 
proof of the theorem which follows. 

Lemma 7. J77J/ // K is predictive complexity of a mixable game g , then there 
is a positive constant c such that |K(x6) — K(a;)| < chin for all n = 1, 2, • • • , 
strings x and bits b. 

We can now show that for a strongly regular mixable game g, the predic- 
tive complexity rate on an infinite sequence of outcomes attains the generalized 
entropy of the stationary ergodic distribution P, almost everywhere. 

4 In [10], [9], the concept is called a superprediction. 



Theorem 6. Let Q = (i?, r, A) be a strongly regular mixable game with predic- 
tive complexity IC. Let (S7, J 7 , P) be the probability space over the outcomes where 
P is a stationary ergodic distribution with generalized entropy H . Then 

n— >oo fi 

for P-almost every uj G Q. 

Proof. (A) Upper Bound: First we show that lirxin-^oo K ^ w< > — 1 < H + e for 
any e > 0. This is an application of our Shannon-McMillan-Breiman theorem, 
Theorem U for generalized entropy. 

Let p n be the strategy which achieves Hu n . There is a computable strategy 
C so that for all < i < n — 1, 

X(a,Ci(w)) < X(a,pi(w)) + | 

for all a £ £ and for all w 6 0. This is possible since set of all such strategies 
constitute an open set. By the definition of predictive complexity, we have 

/CK" _1 ) <Loss«-\ + 0(1) 

< LossKT 1 ^,,) + y + 0(1) 

By the Shannon-McMillan-Breiman Theorem, for large enough n, 

LossK-\p n ) + y +0(1) < + e O(n) + y + 0(l). 



Taking limits as n — > 00, we have that 

/C(w< 



lim ' < H 



n— >oo 



(B) We now establish the reverse inequality, lim^oo 1 ^-f l — - > H — e for 
e > 0. Since 

(KM- 1 • 0) - KK- 1 ), K^- 1 • 1) - KK'- 1 )) 

is a superscore, we have Eiri^ui^ 1 ) > Hi\ n where r\ n = K(wg _1 ) — K(cjq _1 ). 

Now we can apply the martingale strong law of large numbers, Theorem 
VII.5.4 of [13 and get 

i=0 

>TlT,toHi\ n + o(l) = H + o(l), 
where the last equality is obtained by Theorem [2l 
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